Using RSelenium and Docker To Webscrape In R - Using The WHO Snake Database

Webscraping In R with RSelenium - Extracting Information from the WHO Snake Antivenom Database

Making Excuses

Looking back at this post it seems a bit like how to draw an owl.

But! The complicated parts towards the end and are very much the specifics of to download the WHO database. What I wanted to share were the basics of RSelenium installation and use. Hopefully this helps you if you were thinking of doing something similar.

So if you’re reading this not to get ideas of how to scrape with RSelenium, but instead to use the WHO Snake Database in R, I’ve put dataframes for snake species and country data into a package that you can download off github called snakes:

devtools::install_github("callumgwtaylor/snakes")

Intro

Recently I was looking at the WHO Snake Antivenom Database in R, and had to use rvest and purrr from the tidyverse to get the information from the database in a tidy format. This worked well enough as each snake’s information was in it’s own page, in the same format each time. I’ve put it online here

My next aim was to extract the snake country data, looking at which snake species were present.

However, there was a problem. The information about a country’s snakes would be split across multiple pages, with only ten snakes per page. The links to get through to the remainder of the snake info were little javascript links, and so there wasn’t a particular URL I could tell rvest to go to. I couldn’t work out if rvest could use any javascript, and it seemed like I needed to try a different approach.

I needed a package that could load up a page, click through a javascript generated link, then download the information generated on that page. Turns out I needed RSelenium.

This post documents how I installed and used RSelenium to extract information from the WHO Snake Antivenom Database. It’s definitely not done in a “best practices” way, but it should allow you to get to a point where you’re loading sites in RSelenium, and downloading the information for use in R.

Installation

Basically:

  • Selenium is a set of programming tools, a framework, that allows you to automate web browser actions.
  • RSelenium is a R package that allows you to use your seperate installation of selenium inside R
  • Docker is software that allows you to run an environment, where you will run selenium in

What to install?

Docker RSelenium

Also, this assumes that you’re using RStudio and have some understanding of R. By some understanding I guess I mean, you may have to look up how to do things covered in R for Data Science but most of the time when you read it, it makes sense.

Getting it to work

So the workflow here is:

Docker is installed and running. In “Docker For Windows”, you see an icon in the taskbar that you can hover over, and it should state “Docker is running”

In the terminal you start a docker container of the selenium chrome browser. Do this anyway you like. If like me you aren’t really used to the terminal, you can get to it through RStudio (Terminal, NOT Console). Running:

docker run -d -p 4445:4444 selenium/standalone-chrome

Will set up the docker container we need.

If you then run:

docker ps

You should get print out of a table, containing info about the docker container that’s just set up.

Once that’s all working, the rest is back in R.

Load up RSelenium:

library(RSelenium)

And access our selenium browser using the RSelenium package:

remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
                                 port = 4445L,
                                 browserName = "chrome")
remDr$open()

Here what we’re doing is creating an object in R that contains the information about the selenium browser we’ve created in a docker container. Then we’re opening the browser.

Using RSelenium Plus Rvest To Scrape The WHO database

So what we’re going to do here is use RSelenium to identify and navigate to the correct page, then a mishmash of XML and Rvest to download the information on that individual page. Lastly we’ll put everything we’ve done into a mix of functions, allowing us to use purrr to automate going through the entire site.

Part One - Open Browser and Confirm We’re At The Landing Page

# So we've loaded docker up already.
# In the terminal we've run: docker run -d -p 4445:4444 selenium/standalone-chrome

library(RSelenium)
library(rvest)
library(xml2)
library(tidyverse)

remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
                      port = 4445L,
                      browserName = "chrome")
remDr$open()

remDr$navigate("http://apps.who.int/bloodproducts/snakeantivenoms/database/SearchFrm.aspx") #Entering our URL gets the browser to navigate to the page

remDr$screenshot(display = TRUE) #This will take a screenshot and display it in the RStudio viewer

Part Two - Selecting The Country Dropdown Menu, and Navigate To A Particular Country

Most articles I’ve read recommend using Selector Gadget to identify which part of a site you are trying to interact with. When it works, it works really well, but it didn’t want to load on WHO Database.

Instead, I would right click on the page and use the “Inspect” option to identify the correct CSS path for RSelenium to use.

Creating A List Of Countries

snake_countries <- xml2::read_html(remDr$getPageSource()[[1]]) %>%
  rvest::html_nodes("#ddlCountry") %>%
  rvest::html_children() %>%
  rvest::html_text() %>%
  dplyr::data_frame(country_name = .)

snake_countries <- snake_countries %>%
  dplyr::mutate(list_position = 1:160,
                x = stringr::str_c("#ddlCountry > option:nth-child(",list_position, ")"))

# We chop off our first one as we are never going to navigate to there
snake_countries <- snake_countries[-1,]

So using the RSelenium getPageSource() function, and selecting the first element remDr$getPageSource()[[1]], we can use xml2::read_html() to extract the html from the loaded page.

Using rvest::html_nodes() we’ve selected the chunk that we identified earlier with Inspect. Using rvest::html_children() we can extract attributes from the countries chunk, and then using rvest::html_text() we can extract a list of names of countries, which we turn into a column in a data_frame.

This data_frame is expanded with a column showing how far down the dropdown menu something is, plus a column “x”. x is a bit more complicated. What we’ve done is created the css address for each option in the drop down menu using stringr. This address will be used later when we want to go to a specific country.

We chop off the first row in our new data_frame as it’s some rubbish “Select country/territory” text that we don’t need.

This can be done manually, using the addresses we’ve generated:

element<- remDr$findElement(using = 'css selector', "#ddlCountry > option:nth-child(65)")
element$clickElement()

# Printing 
html <- xml2::read_html(remDr$getPageSource()[[1]])
xml2::write_html(html, "india.html")

The new things we’re doing here is navigating with RSelenium to a specific part of the dropdown menu, using the CSS address we’ve got for a particular item (India) and findElement(). Then simply clicking on it!

We can extract the html from the results and compare what we’ve got with what’s seen on the official website.

So none of the prettiness, but all the information we wanted!

Part Three - Finding The Seperate Pages Of Snakes

Using the inspect page in our new tab, it’s clear that the name of the table we want is SnakesGridView. Every first table for countries where there are more than ten snakes, will also have twelve rows. That twelfth row (the reason we’re talking about tr:nth-child(12) ) will contain the links to the subsequent pages. This was all found through rummaging with Inspect.

element <- remDr$findElement(using = 'css selector', "#SnakesGridView > tbody > tr:nth-child(12) > td > table > tbody > tr > td:nth-child(2)")
element$clickElement()
remDr$screenshot(display = TRUE)

Part Four - Download One Country

Success!

Now there’s just a couple of things left to do.

Download the snake information from the first page of a country profile and store it as a dataframe. Identify whether there is a second/third/fourth page for the profile. Go to these pages and download them.

Working out if a country have more than ten snakes or not is easy enough, thanks to the way they’ve been formatted. If a country has a single page, the html table created by rvest has four columns. If it has multiple pages, the html table has 6, as the links at the bottom mess things up. So we could put the multiple pages stuff statement.

This chunk below is a bit convoluted. I’ve tried to comment on it to make it clearer. The steps basically are:

  • Load the country page
  • Download the snakes table for that country (if there’s only one page, that’s all we need to do)
  • Work out if there are more pages for the country
  • If there additional pages then do the following:
  • Work out the directions to get to the additional pages
  • Write a function to download from a single additional page
  • Use purrr::pmap to use those directions and the function to download all the additional pages
  • Put all the additional pages for one country together.
  • Reformat it so it looks the same as countries that just have one page
# Then Extract The Snake Page
country_html <- xml2::read_html(remDr$getPageSource()[[1]])

# We download The Table For the First Page, and if there is only one page that's all we need to do!

country_table <- country_html %>%
  rvest::html_node("#SnakesGridView") %>%
  rvest::html_table(fill = TRUE)

# Then Determine If There Are More Pages
more_pages <- length(country_table) > 4

# Create a function to download the additional page's information

snake_country_secondary_download <- function(page_element){
  
  element <- remDr$findElement(using = 'css selector', page_element)
  element$clickElement()
  
  country_html <- xml2::read_html(remDr$getPageSource()[[1]])
  
  secondary_country <- country_html %>%
  rvest::html_node("#SnakesGridView") %>%
  rvest::html_table(fill = TRUE)
  
  secondary_country
}

# Put everything about these bigger pages in an if statement

if(more_pages == TRUE){
  # Then Work Out Exactly How Many More Pages - This is messy, I don't know why it cant go into a single html_node
  # but it seems to work this way and not the other way.

  country_table_number <- country_html %>%
  rvest::html_node("#SnakesGridView") %>%
  rvest::html_node("tbody > tr:nth-child(12)") %>%
  rvest::html_node("td") %>%
  rvest::html_node("table") %>%
  rvest::html_node("tr") %>%
  rvest::html_nodes("td") %>%
  length()
  
  # Create the links for these secondary pages
  country_pages <- dplyr::data_frame(
    page_number = 1:country_table_number,
    page_element = stringr::str_c("#SnakesGridView > tbody > tr:nth-child(12) > td > table > tbody > tr > td:nth-child(", page_number, ")"))
  
  country_pages <- country_pages[-1,]
  country_pages <- country_pages[,2]

    #country_pages left is the data_frame containing the address for the links for all the subsequent pages for each individual country
    
    #use purrr::pmap to run through each of the secondary pages with our function we created earlier. Then merge them all together
  secondary_country <- purrr::pmap(country_pages, download_secondary_country) %>%
  dplyr::bind_rows()
  
  # reformat these secondary pages, so they look the same as our pages where there's less than ten snakes

  country_table <- dplyr::bind_rows(country_table, secondary_country)
  country_table <- country_table[,1:4] %>%
    dplyr::filter(is.na(`Link*`))
}

Step Five - Final Step - Automate To Download All Countries

This section below is then close to identical to the section above. We’ve got it to work for one country. Now we just need to put it into a function, to allow us to work through every country.

The arguments the function takes are the address of the country, the WHO ID of the country and the name of the country.

snake_country_download <- function(x, country_name){
  # Then Go To Our Country
  remDr$navigate("http://apps.who.int/bloodproducts/snakeantivenoms/database/SearchFrm.aspx") # First we load the databas
  element<- remDr$findElement(using = 'css selector', x)
  element$clickElement() #We'll need to create a way to insert the country into that child form
  
  # Then Extract The Snake Page
  country_html <- xml2::read_html(remDr$getPageSource()[[1]])
  
  country_table <- country_html %>%
    rvest::html_node("#SnakesGridView") %>%
    rvest::html_table(fill = TRUE)
  
  # Then Determine If There Are More Pages
  more_pages <- length(country_table) > 4
  
  if(more_pages == TRUE){
    # Then Work Out Exactly How Many More Pages
    country_table_number <- country_html %>%
      rvest::html_node("#SnakesGridView") %>%
      rvest::html_node("tbody > tr:nth-child(12)") %>%
      rvest::html_node("td") %>%
      rvest::html_node("table") %>%
      rvest::html_node("tr") %>%
      rvest::html_nodes("td") %>%
      length()
    
    # Then create a data frame, with new address from us to download these from
    country_pages <- dplyr::data_frame(
      page_number = 1:country_table_number,
      page_element = stringr::str_c("#SnakesGridView > tbody > tr:nth-child(12) > td > table > tbody > tr > td:nth-child(", page_number, ")"))
    
    country_pages <- country_pages[-1,]
    country_pages <- country_pages[,2]
    
    # Then use a secondary function to download these
    secondary_country <- purrr::pmap(country_pages, snake_country_secondary_download) %>%
      dplyr::bind_rows()
    
    # Then, for countries that we had to download multiple pages, merge them all together
    country_table <- dplyr::bind_rows(country_table, secondary_country)
    country_table <- country_table[,1:4] %>%
      dplyr::filter(is.na(`Link*`))
  }
  
  country_table <- country_table %>%
    dplyr::mutate(country_name = country_name)
  
  message(country_name)
  country_table
}

We lastly use purrr::pmap() one last time to work through the addresses of countries we’ve already created, using the download function we just made.

snake_country_data <- purrr::pmap(snake_countries, snake_country_download) %>%
  dplyr::bind_rows()

snake_country_data <- snake_country_data %>%
  dplyr::select(country_name, snake_category = `Cat**`, snake_common_name = `Common name`, snake_species = `Species name`)

I used a few pieces to work out how to do this:

rvest article

rvest and purrr talk

the RSelenium vignettes

If you think that there are ways I should do this differently please let me know on twitter (@callumgwt) or on the github page for this package.

Same Disclaimer As Last Time

Lastly, this is clearly not my data and I make no claims of ownership whatsover. The WHO are the copyright holders for any data, and whilst I think this package comes under acceptable use for research, please let me know if you’re someone from the WHO who disagrees.

Actually lastly, this data is extracted from the WHO database, but I am not making any claims about their accuracy, so this information should not be used to make any clinical decision on use of antivenom, or any similar decisions.

New Package In Development: Snakes

tl:dr snakes is a package containing data on individual snake species, from the World Health Organization Snake Database. You can get an early version from “devtools::install_github(”callumgwtaylor/snakes“)”

Recently for an assignment for my masters I was looking at the global burden of snakebite. Keen to get some pretty diagrams in to distract the reviewers from my terrible writing, I went online to find a resource we could analyse in R. This had worked well enough for previous essays, especially with the existence of The Humanitarian Data Exchange, plus a mini package I’d written for my own purposes to import their data easily.

However after a brief search, the only online resource I could see was the WHO Snake and Antivenoms Database. Most organisations are keen to get you to have a delve into what they have, recently I’ve used data from The Global Terrorism Database and the International Disasters Database. Both required a bit of registration but ultimately sent you a csv download.

The WHO Database however, looked like this: And once you selected a snake, you got this:

Now, I really like this set up, there’s a clear layout of information for each snake on their individual page. You get a map and an image, and most importantly, which antivenoms exist. However, if you want to make comparisons between lots of snakes, that becomes a bit more difficult. If you want to explore which snakes don’t have antivenoms, or the global distibution of antivenom producers, or anything else then there’s no straightforward way to do so without a lot of loading individual pages to take information down.

At this point, I realised that me working out how to extract any info would definitely allow me to procrastinate beyond the essay deadline, so I gave up. But I did want to learn how to do some basic scraping if I bumped into a similar issue in the future. Using this article on rvest, plus this talk on purrr and rvest, I’ve downloaded the data for each species of medically important snakes from the WHO database.

Now if you load the snakes package from github you can save yourself the hassle of what I did:

devtools::install_github("callumgwtaylor/snakes")
library(snakes)
snake_species <- snakes::snake_species_data

snake_species_data will give you:

  • The Identifier the WHO uses for the snake species
  • Snake Common Name
  • Snake Species Name
  • Snake Family
  • A link to the WHO map of snake distribution
  • A link to the legend for the WHO map
  • A link to a WHO picture of the snake
  • The first other common name for the snake
  • Any other common names for the snake
  • Any previous names for the snake
  • A nested data_frame, containing the regions and subregions the snake is found in
  • A nested data_frame, containing the snake antivenom product names, manufacturers, and countries of origin

This is the first version of the package containing what I’ve put together today. I hope to update it with some more useful geographical information. The database does have individual country pages, so a data_frame including snakes in each country is the next target.

Lastly, this is clearly not my data and I make no claims of ownership whatsover. The WHO are the copyright holders for any data, and whilst I think this package comes under acceptable use for research, please let me know if you’re someone from the WHO who disagrees.

Actually lastly, this data is extracted from the WHO database, but I am not making any claims about their accuracy, so this information should not be used to make any clinical decision on use of antivenom, or any similar decisions

The State of Cholera in Yemen in December

Cholera in Yemen : UPDATED 16-DECEMBER-2017

At the time of writing, the WHO has released information about 959810 cases of cholera in Yemen, with 2219 deaths. This post uses data released on 2017-11-26. This information has been collated by the Humanitarian Data Exchange (HDX), and put online. All the information below has been taken from HDX and read into R using hdxr. The code to run it all is in this rmarkdown document

Cholera Cases

Total number of cases of cholera in each governate in Yemen

Daily number of new cases of cholera in each governate in Yemen

Cholera Deaths

Total number of deaths from cholera in each governate in Yemen

Country Level

Cholera Deaths Table

Administrative District Deaths Cases
Hajjah 417 106933
Ibb 284 59932
Al Hudaydah 271 139145
Taizz 184 58223
Amran 174 94581
Dhamar 160 90560
Al Mahwit 148 56447
Sana’a 122 68453
Raymah 117 14497
Al Dhale’e 81 47004
Amanat Al Asimah 70 91799
Aden 62 20286
Abyan 35 28103
Al Bayda 33 26793
Al Jawf 22 14689
Lahj 21 22596
Marib 7 6897
Sa’ada 5 9722
Shabwah 3 1396
Hadramaut 2 587
Al Maharah 1 1167
Socotra NA NA

Cholera Cases Table

Administrative District Deaths Cases
Al Hudaydah 271 139145
Hajjah 417 106933
Amran 174 94581
Amanat Al Asimah 70 91799
Dhamar 160 90560
Sana’a 122 68453
Ibb 284 59932
Taizz 184 58223
Al Mahwit 148 56447
Al Dhale’e 81 47004
Abyan 35 28103
Al Bayda 33 26793
Lahj 21 22596
Aden 62 20286
Al Jawf 22 14689
Raymah 117 14497
Sa’ada 5 9722
Marib 7 6897
Shabwah 3 1396
Al Maharah 1 1167
Hadramaut 2 587
Socotra NA NA

The State of Cholera in Yemen in November

Cholera in Yemen : UPDATED 16-NOVEMBER-2017

At the time of writing, the WHO has released information about 913741 cases of cholera in Yemen, with 2196 deaths. This post uses data released on 2017-11-08. This information has been collated by the Humanitarian Data Exchange (HDX), and put online. All the information below has been taken from HDX and read into R using hdxr. The code to run it all is in this rmarkdown document

Cholera Cases

Total number of cases of cholera in each governate in Yemen

Daily number of new cases of cholera in each governate in Yemen

Cholera Deaths

Total number of deaths from cholera in each governate in Yemen

Country Level

Cholera Deaths Table

Administrative District Deaths Cases
Hajjah 414 100850
Ibb 282 57136
Al Hudaydah 268 131827
Taizz 184 54422
Amran 170 89729
Dhamar 157 84741
Al Mahwit 145 52920
Sana’a 122 66086
Raymah 116 13903
Al Dhale’e 81 46721
Amanat Al Asimah 68 87578
Aden 62 19816
Abyan 35 27957
Al Bayda 31 25905
Al Jawf 22 13722
Lahj 21 22524
Marib 7 6102
Sa’ada 5 8662
Shabwah 3 1390
Hadramaut 2 586
Al Maharah 1 1164
Socotra NA NA

Cholera Cases Table

Administrative District Deaths Cases
Al Hudaydah 268 131827
Hajjah 414 100850
Amran 170 89729
Amanat Al Asimah 68 87578
Dhamar 157 84741
Sana’a 122 66086
Ibb 282 57136
Taizz 184 54422
Al Mahwit 145 52920
Al Dhale’e 81 46721
Abyan 35 27957
Al Bayda 31 25905
Lahj 21 22524
Aden 62 19816
Raymah 116 13903
Al Jawf 22 13722
Sa’ada 5 8662
Marib 7 6102
Shabwah 3 1390
Al Maharah 1 1164
Hadramaut 2 586
Socotra NA NA

In Yemen, cases of cholera and deaths continue to rise

Cholera in Yemen : UPDATED 22-JULY-2017

At the time of writing, the WHO has released information about 368207 cases of cholera in Yemen, with 1828 deaths. This post uses data released on 2017-07-19. This information has been collated by the Humanitarian Data Exchange (HDX), and put online. All the information below has been taken from HDX and read into R using hdxr. The code to run it all is in this rmarkdown document

Over the last fortnight the rate of new cases of cholera has slowed slightly, but we’re still seeing more than 5000 new cases daily. Whilst numbers of new diagnoses of cholera are dropping in regions like Sana’a, more badly affected governates like Al Hudaydah show no signs of slowing down. The mortality rate is staying relatively static, with one in every 200 cases of cholera resulting in death.

According to Oxfam, the total number of cases of cholera in Yemen could double with the rainy season to over 600,000. Keeping our current mortality rates, that's 3000 deaths from a preventable illness.

Cholera Cases

Total number of cases of cholera in each governate in Yemen

Daily number of new cases of cholera in each governate in Yemen

Cholera Deaths

Total number of deaths from cholera in each governate in Yemen

Country Level

Cholera Deaths Table

Administrative District Deaths Cases
Hajjah 353 38936
Ibb 233 28250
Al Hudaydah 212 45580
Taizz 161 25927
Amran 150 37814
Dhamar 124 26343

Cholera Cases Table

Administrative District Deaths Cases
Amanat Al Asimah 60 47647
Al Hudaydah 212 45580
Hajjah 353 38936
Amran 150 37814
Ibb 233 28250
Dhamar 124 26343

Cholera in Yemen is getting worse

Publicly available data about the cholera crisis in Yemen is provided through bulletins from the World Health Organistion. Released online, each one gives the most recent running total of cholera cases and deaths in each part of the country. You can get a copy yourself here

The code to recreate this map and the plot is available here

The map below shows the most recent numbers. Currently it’s the more densely populated areas of the West coast that are most badly affected, particularly Sana’a and Al Hudaydah.

However what this map doesn’t show easily, is that things seem to be getting worse.

If we take each bulletin’s total number of cases and subtract the previous cases (and adjust for the fact we sometimes have to wait a few days for an update), it seems that the number of new cases of cholera is increasing in several areas. Al Hudaydah had around 250 cases per day at the start of the epidemic, but at the most recent bulletin we saw a 1000 new cases.

In fact in the most recent bulletin: Al Hudaydah, Hajjah, and Al Dhale’e have all averaged over 1000 new cases per day since the bulletin before.

Cholera in Yemen - mapping deaths from the current epidemic using HDX in R

Cholera in Yemen : UPDATED 12-JULY-2017

At the time of writing this, the WHO has released information about 320199 cases of Cholera in Yemen, with 1742 deaths.

This information has been collated by the Humanitarian Data Exchange (HDX), and put online. I made a new function for hdxr the other day to make it easier to use maps from HDX, and wanted to learn more about what’s happening in Yemen.

All the information below has been taken from HDX and read into R using hdxr. The code to run it all is in this rmarkdown document

Cholera Deaths

Cholera Cases

Cholera Deaths Table

Administrative District Deaths Cases
Hajjah 338 35310
Ibb 227 25433
Al Hudaydah 199 38942
Taizz 150 22903
Amran 149 32625
Dhamar 114 20848

Cholera Cases Table

Administrative District Deaths Cases
Amanat Al Asimah 56 42765
Al Hudaydah 199 38942
Hajjah 338 35310
Amran 149 32625
Ibb 227 25433
Sana’a 111 24360

Downloading data from HDX easily in a tidy format - hdxr and hdx_resource_csv

An easier pipeline for Humanitarian Data Exchange, with a tidy format

I’ve been trying to make it easier to extract data from HDX in R, using tidyverse and ropensci packages. I’ve started compiling a mini-package called hdxr that wraps these pipelines up neatly.

I’ve added a new function today hdx_resource_csv that means to download datasets from HDX all you need to do is the following:

library(hdxr)
hdx_connect()
datasets <- hdx_package_search(term = "data title") %>%
  hdx_resource_list() %>%
  hdx_resource_csv()

I’ve described most of these functions in previous posts, but basically:

hdx_connect uses ckanr to connect to the hdx ckan server

hdx_package_search will search hdx for the packages you’re looking for and return a dataframe. (You can use hdx_list to find titles of datasets)

hdx_resource_list will take that dataframe and use tidyverse features to extract information about the datasets themselves.

hdx_resource_csv

The new function hdx_resource_csv will take the results of hdx_resource_list, and return a new dataframe. This will have three columns:

  • identifier a title merged from the package and dataset titles
  • location the url where the csv was downloaded from
  • csv a nested dataframe downloaded from the location url

hdx_resource_csv provides a nested dataframe column to allow a sensible output when you download multiple csvs all with different columns in one go.

When you want to extract a particular csv to work with, you can use dplyr::filter() and tidyr::unnest() to get at it. You could just unnest without filtering, but when every csv has different column titles, the output is a bit messy.

Example:

library(tidyverse)
library(hdxr)
hdx_connect()
datasets <- hdx_package_search(term = "141121-sierra-leone-health-facilities") %>%
  hdx_resource_list() %>%
  hdx_resource_csv()

The above would give us a dataframe that looks like:

you can then select what you want with filter and unnest it:

sierra_leone_healthsites_sbtf_sle_health <- datasets %>%
  filter(dataset_identifier == "sierra-leone-healthsites_sbtf-sle-health") %>%
  unnest()

Any use? Wrong way of doing things?

I think this pipeline makes sense for me to extract data from HDX. And the majority of files on there are CSV. If you’ve thoughts of how I should do it differently, please let me know either on github directly or just through twitter

If you want to try it yourself, download and installation instructions are on the github readme.

Using sf, gganimate, and the Humanitarian Data Exchange to map ACLED data for Africa

Mapping fatalities from violence in Africa in 2017

In my previous post I showed how I was trying to get data directly from the Humanitarian Data Exchange (HDX), in an R pipeline.

Here’s a quick example of what it allows you to do. The visualisations in this aren’t pretty, it’s mainly meant as an example of what you can easily do. I’m really enjoying using sf to plot maps in ggplot2, thanks to Matt Strimas-Mackey showing how to get that working.

The code to run this example is on my github in an R markdown file.

These graphs use two datasets, the ACLED files were downloaded directly from HDX. I then was able to filter out just events with fatalities, and just for 2017. The shapefiles are from the rnaturalearth package.

DISCLAIMER: I do not know how complete this ACLED dataset is, and do not want to pretend it paints a full and accurate picture about a subject I have no expertise in!

Static map of fatalities from political violence

gganimate plot of fatalities from political violence, seperated monthly

Getting data from Humanitarian Data Exchange in a reproducible R pipeline

Update: I’ve updated a couple of these functions, as they were messy and unreliable (they probably still are!). I’ve also put them together as a mini-package that you can install from github: hdxr. The code below is the OLD version, the up to date versions are on github. If you’ve ideas on how I should do it better, please let me know on github/twitter.

I’ve been trying to make it easier for me to get information from the Humanitarian Data Exchange. The folk who run the centre have been trying to make it easier to access too, they’ve released a python api, and they use CKAN so you can download JSON information about each ‘package’ (basically the subject of the data) and each ‘resource’ (the data itself). The problem for me is I don’t know python, I’m not comfortable with CKAN, and struggle with JSON also.

All I wanted to be able to do was search for a package, and download a specific resourse, in a reproducible and roughly tidy-ish fashion

I’ve made a couple of functions to make it easier to do so. I’ve been using ckanr and jsonlite

So here are the functions I’ve made, and an example of using them:

Functions for interacting with the HDX CKAN

library(ckanr)
library(tidyverse)
library(jsonlite)

hdx_connect()

This will use the ckanr package to connect to the HDX ckan server.

# This creates a function to connect to the hdx server
hdx_connect <- function(){
ckanr_setup(url = "http://data.humdata.org/")
}

hdx_list()

This function takes one argument, limit. It will return a list of HDX packages, depending on the limit set. There’s currently almost 5000 packages.

# This function will create a data frame of the length set in x, listing the packages in x
hdx_list <- function(limit){
package_list(as = 'table', limit = limit) %>%
as_data_frame(.)
}

hdx_resource_list()

To see the exact resources available in an easier-to-read dataframe they need unnesting. When this function is used on the results of hdx_package_search(), it will extract a resources dataframe, then left join the results onto the original dataframe provided to it.

# This function will take the results of a package_search and extract the resources, it will then link those resources to the results from package search giving a new data frame
hdx_resource_list <- function(package){
package$resources %>%
    as.data.frame(.) %>%
    left_join(package, ., by = c("id" = "package_id")) %>%
    select(-resources)
}

Getting data off of HDX

Once you’ve used all the functions, you should have a dataframe with titles for the packages, and urls for the resources. You can then use httr or readr to download the files and bring them into your work.

Example: Exploring the ACLED Conflict Data for Africa

In this example I am going to identify the package “ACLED Conflict Data for Africa (Realtime - 2017)” here. It has an excel file, and a zipped csv. I’ll then identify the resource for the unzipped excel file.

Once we have that, we can use httr to download the file, and readxlto load it in.

library(ckanr)
library(tidyverse)
library(jsonlite)
library(readxl)
library(httr)
# First we connect to HDX

hdx_connect()

# We can list all packages available for us

hdx_list(5000)
## # A tibble: 4,905 x 1
##                                                                          value
##                                                                          <chr>
##  1                                       141121-sierra-leone-health-facilities
##  2                                      160516-ecuador-earthquake-4w-1st-round
##  3                                                      160523-ocha-4w-round-2
##  4                                                     160625-hrrp-4w-national
##  5 1999-2013-tally-of-internaly-displaced-persons-resulting-from-natural-disas
##  6                                                                  2011-nepal
##  7                                       2012-census-tanzania-wards-shapefiles
##  8                                        2014-2015-food-security-ipc-analysis
##  9                         2014-nutrition-smart-survey-results-and-2015-trends
## 10                                 2015-humanitarian-needs-overview-indicators
## # ... with 4,895 more rows
# We then search for the package we want, and use dplyr to filter it to the exact package

hum_data_packages <- hdx_package_search("ACLED Conflict Data for Africa") %>%
  filter(title == "ACLED Conflict Data for Africa (Realtime - 2017)")

# We then expand our resources from the search result, and look for the unzipped excel file

hum_data_resources <- hdx_resource_list(hum_data_packages) %>%
  filter(format == "XLSX")
url <- hum_data_resources$hdx_rel_url
GET(url, write_disk("dataset.xlsx", overwrite=TRUE))
## Response [http://www.acleddata.com/wp-content/uploads/2017/06/ACLED-All-Africa-File_20170101-to-20170617.xlsx]
##   Date: 2017-07-03 12:21
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 1.86 MB
## <ON DISK>  C:\Users\callu\Dropbox\blog\content\post\dataset.xlsx
ACLED_CONFLICT_DATA <- read_excel("dataset.xlsx", col_names = TRUE)
rm(url)

Mini Package: hdxr

Mini package - Mini post

I realised that when I wrote this post, I should do something more sensible to share the code then put the text up in a blog. Writing a package seemed overkill though, and something you only do for ‘real code’.

But curiousity got the better of me, and after reading the go-to post for package writing, I thought I might as well try.

So here we are: hdxr

It has the functions I mentioned before in the above post

hdx_connect() hdx_list() hdx_package_search() hdx_resource_list()

How to use them is explained in the github repo and blog post above. There has been no testing whatsoever and I bet they wont work first try for you. If you want to let me know how to improve them or how to use ckanr better then I can be contacted on github or twitter.

Plotting deprivation in Scotland, using geofacet and sf in R

Using the geofacet package to plot deprivation in Scotland by Health Boards

A few months ago, when I was first starting to learn to use R, I tried looking at the data from the Scottish Index of Multiple Deprivation. The Scottish Government split Scotland up into 6976 equally populated “data zones” (not quite neighbourhoods, but pretty close), and ranked them from most deprived (1), to least deprived (6976)

Recently I’ve gone back to the same files, to see if what I’ve learnt has made it easier to look at deprivation in Scotland.

This weekend I found out about a new-ish package, geofacet from Ryan Hafen. Luckily for me, Joseph Adams had already submitted a grid for Scottish Health Boards, making it so easy to plot this all out.

I’ve included the code I used to make the plot.

library(tidyverse)
library(sf)
library(readxl)
library(geofacet)
map_scot <- st_read("../data/scot_gov_data/data_zone_shapefiles/.")
## Reading layer `SG_SIMD_2016' from data source `C:\Users\callu\Dropbox\blog\content\data\scot_gov_data\data_zone_shapefiles' using driver `ESRI Shapefile'
## Simple feature collection with 6976 features and 49 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 5513 ymin: 530252.8 xmax: 470323 ymax: 1220302
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +datum=OSGB36 +units=m +no_defs
data_postcode_simd <- read_excel("../data/scot_gov_data/00505244.xlsx", sheet = 2)
data_simd_ranks <- read_excel("../data/scot_gov_data/00512735.xlsx", sheet = 3)

data_simd_ranks$HBname[data_simd_ranks$HBname == "Western Isles"] <- "Western Isle"

Deprivation in Scotland by Health Boards

So looking at this, brighter colours equal lower average levels of deprivation. A distribution towards the left shows a healthboard with a greater proportion of deprived datazones.

Using the geofacet package, we swap out a normal facet_wrap() for facet_geo(), and tell it what layout we want grid=("nhs_scot_grid"). The only problem is that this grid has “Western Isles” saved as “Western Isle”, so we have to rename out own data to match this. Now our health boards are placed in roughly a geographical order.

data_simd_ranks %>%
  select(DataZone = DZ, HBname) %>%
  left_join(map_scot, .) %>%
  group_by(HBname) %>%
  mutate(median_deprivation = median(Rank)) %>%
  ungroup() %>%
  ggplot() +
  geom_histogram(aes(x = Rank, y = ..ncount.., fill = median_deprivation), binwidth = 100) +
  facet_geo(~HBname, grid = "nhs_scot_grid") +
  labs(title = "Deprivation in Health Boards in Scotland",
       x = "Relative deprivation of data zones (left = more deprived neighbourhoods)",
       y = "Proportion of data zones",
       caption = "Darker colour shows increased average deprivation of healthboard") + 
  theme_bw() +
  theme(legend.position = "none") 

Glasgow versus Edinburgh

In the plot above, the west coast / east coast divide is looking pretty big in the central belt of Scotland. The plot below shows this even more. Glasgows neighbourhoods are massively skewed towards some of the most deprived parts of Scotland, whereas in Edinburgh we see the opposite. Obviously Glasgow still has some very well off parts, and Edinburgh has some deprived areas, but in this graph they seem to be polar opposites.

city_data <- data_simd_ranks %>%
  select(DataZone = DZ, LAname) %>%
  left_join(map_scot, .) %>%
  group_by(LAname) %>%
  filter(LAname == "Glasgow City" | LAname == "City of Edinburgh") %>%
  mutate(median_deprivation = median(Rank)) %>%
  ungroup()

city_data$LAname <- parse_factor(city_data$LAname, levels = c("Glasgow City", "City of Edinburgh"))

  ggplot(city_data) +
  geom_histogram(aes(x = Rank, y = ..ncount.., fill = median_deprivation), binwidth = 100) +
  facet_wrap(~LAname) +
  labs(title = "Deprivation in Glasgow and Edinburgh",
       x = "Relative deprivation of data zones",
       y = "Proportion of data zones",
       caption = "Distribution to the left shows increased deprivation in city, darker colour shows increased average deprivation of city") + 
  theme_bw() +
  theme(legend.position = "none") 

About

This shows what I’m procrastinating with and pretending counts as work.

This is a personal site built using the blogdown package. The theme is from @spf13/hyde

Archive

Blog Posts